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The process of protein search for specific binding sites on DNA is fundamentally important since 
it marks the beginning of all major biological processes. We present a theoretical investigation that 
probes the role of DNA sequence symmetry, heterogeneity and chemical composition in the protein 
search dynamics. Using a discrete-state stochastic approach with a first-passage events analysis, 
which takes into account the most relevant physical-chemical processes, a full analytical description 
of the search dynamics is obtained. It is found that, contrary to existing views, the protein search 
is generally faster on DNA with more heterogeneous sequences. In addition, the search dynamics 
might be affected by the chemical composition near the target site. The physical origins of these 
phenomena are discussed. Our results suggest that biological processes might be effectively regulated 
by modifying chemical composition, symmetry and heterogeneity of a genome. 
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Many biological processes are initiated by proteins 
binding the specific target sequences on DNA. In particu¬ 
lar, this process is responsible for transferring and main¬ 
taining the genetic information contained in DNA EH- 
It was recognized long time ago that finding these spe¬ 
cific binding sites could be quite a complicated task due 
to large number of other nonspecific sites (~ 10® — 10®) 
and low concentration of relevant proteins. But experi¬ 
ments suggest that many proteins find their targets much 
faster than expected from 3D bulk diffusion estimates [3- 
. This surprising phenomenon is known as a facilitated 
diffusion. A significant progress in explaining facilitated 
diffusion processes has been achieved in recent years due 
to multiple experimental and theoretical advances [3-l33| . 
However, the detailed mechanisms of the protein search 
for targets on DNA remain not well understood [3, H, [13 ■ 
It is now widely accepted that proteins searching for 
the specific binding sites on DNA at some conditions 
might alternate between 3D and ID search modes 1 , 0 - 
iini. This means that the protein molecule binds non- 
specifically to DNA, then slides along the chain, unbinds 
and repeats the scanning cycle several times until it finds 
the target. Recent single-molecule experiments that can 
visualize the dynamics of individual molecules support 
this picture [l3, EE S, lUl EE EE| • These observations 
also underline the critical role of protein-DNA interac¬ 
tions in the facilitated diffusion. Since DNA molecule is 
a heterogeneous biopolymer, the sequence symmetry and 
its chemical composition must be an important factor in 
the protein search for targets. However, how specifically 
the sequence heterogeneity influences the protein search 
dynamics remains a controversial problem. 

The protein search on the random DNA sequences have 
been theoretically investigated before 0,E3- Comparing 
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this process with a motion in the random potential, it was 
shown that the heterogeneous character of the chain leads 
to the larger search times in comparison with a homoge¬ 
neous case. But later it was argued that this result is not 
applicable to the protein search [IE . It is just an artifact 
of the continuum approximation, which assumed that the 
protein can reach the target only via DNA sliding, ne¬ 
glecting 3D associations and dissociations events [^. A 
more advanced computational study of the sequence het¬ 
erogeneity also found that it usually slows down the fa¬ 
cilitated diffusion by creating traps HE- But it was also 
suggested that the properly positioned traps in the funnel 
shape near the target can accelerate the protein search 
[sEj . At the same time, it is not clear if such funnel 
distributions are observed in real systems. Furthermore, 
recent theoretical studies of Lukatsky and coworkers [dfil - 
EEl suggested that the sequence symmetry creates addi¬ 
tional effective interactions between DNA and protein 
molecules. Using methods of equilibrium statistical me¬ 
chanics, it was found that more homogeneous segments 
of DNA effectively attract proteins stronger than the het¬ 
erogeneous segments. However, the role of these effective 
interactions in the protein search for targets on DNA has 
not been tested yet. 

In this article, we present a theoretical approach that 
allows us to investigate explicitly the effect of sequence 
heterogeneity in the protein search for targets on DNA. 
It is based on a discrete-state stochastic method which 
takes into account the most relevant physical-chemical 
processes of the protein search by analyzing first-passage 
events in the system m, EE. The advantage of this 
method is that it provides a full analytical description 
of the facilitated diffusion. One of the main results of 
this approach is a development of the general dynamic 
phase diagram for the target search [^. Three dy¬ 
namic search regimes where identified depending on the 
different length scales in the system. For protein slid- 
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ing length A larger that the size of the DNA chain L, 
the protein molecule always stays on DNA and performs 
ID search with a random-walk dynamics. This leads to 
the quadratic scaling of the search times as a function 
of the DNA length. When the sliding length is smaller 
than the length of DNA but larger than the target size 
(1 < A < L), the protein is searching by combining 3D 
and ID motions. In this sliding regime, the linear scal¬ 
ing of the search times is observed. A different dynamic 
phase is found for the case of the sliding length smaller 
than the target size, A < 1. Here the search is accom¬ 
plished only via 3D associations and dissociations events 
without sliding along the DNA molecule. This also leads 
to the linear scaling in the search times as a function of 
the DNA length. 

In our model, we consider a single DNA molecule with 
L-l-1 binding sites and a single protein molecule, as shown 
in Fig. I. One of the binding sites is a target, and for 
convenience we put it in the middle of the chain, i.e., 
m = L/2 + 1. To model the sequence heterogeneity, we 
assume that each monomer in the DNA chain can be in 
one of two chemical states, A or B (see Fig. 1). When the 
protein is bound to the segment A (B) it interacts with 
energy and e = > 0. This means that 

the protein attracts stronger to the B sites. The protein 
molecule can diffuse along DNA with the rate ua = u 
{ub = ue~^, where e is measured in ksT units). Here we 
assume that, independently of the chemical state of their 
neighbors, moving out of the sites A are characterized 
by the rate ua, while the diffusion out of the sites B is 
given hy Ub- The protein search starts in the solution 
that we label as a state 0. Then the protein molecule 
can bind to any site A or H on DNA with the corre¬ 
sponding rates = kon or k^^ = konS^^■ Similarly, 
the dissociations from the DNA chain are described by 
the rates k^^j, = koff and = kof. Here the 
parameter 0 < 0 < 1 specifies how the protein-DNA in¬ 
teraction energy is distributed between the association 
and dissociation transitions. We also assume that the 
binding to the target is given by = kon- To test the 
effect of the sequence symmetry and heterogeneity we 
consider the protein search on two different types of the 
DNA molecules: see Figs, lb and Ic. One of them con¬ 
sists of two homogeneous segments of only A and only B 
subunits separated by the target (Fig. lb). Another one 
is the biopolymer with alternating A and B sites, as pre¬ 
sented in Fig. Ic. The block copolymer (Fig. lb) has a 
more homogeneous sequence, while the alternating poly¬ 
mers (Fig. Ic) are more heterogeneous. It is important 
to note that in both cases the overall interaction between 
the protein and DNA is the same (the overall chemical 
composition in both cases is identical), and thus our anal¬ 
ysis probes only the effect of the heterogeneit y. This is 
different from previous computational studies |34l |. 

To describe the target search dynamics, let us intro¬ 
duce a function F„ (t), which is defined as a first-passage 
probability to reach the target, if at t = 0 the protein 
was at the site n (n = 1,2,...L-l-1 corresponds to the 
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FIG. 1. a) A general scheme of the protein search. The DNA 
chain consists of L nonspecific binding sites and one specific 
site that is a target for the search. A protein, coming from the 
solution, can bind to any site on DNA with the association 
rate per one segment given by fcin with i = A or B. When 
attached, the protein can diffuse along the DNA with the rate 
Ui (i = A or B), and it can dissociate into the solution with 
the rate (i = A or B). The search is finished when the 
protein binds to the target site at the position m — L/2 + 1. 
b) A fully symmetric AB block copolymer DNA sequence, c) 
Pseudo-random alternating sequences with different compo¬ 
sitions near the target. 

starting DNA and n = 0 is for the bulk solution). The 
temporal evolution of this quantity can be described by 
the backward master equations [23|, 

= Un[Fn-l+Fn+l]+k^J}\Foit)-{2Un + k^J}\)Fnit), 

( 1 ) 

for 1 < n < L + 1, while in the solution we have 

7 771 L~\-l L~\-l 

^ = ( 2 ) 

n—1 n—1 

It is convenient to analyze these equations in the Laplace 

OO 

space using a transformation F„(s) = J F„(t)e“®*dL 

0 

Then all probabilities can be found explicitly, which leads 
to the full dynamic description of the search process. 
The details of the calculations are presented in the Sup¬ 
plementary Material. More specifically, the mean first- 
passage time to reach the target starting from the solu¬ 
tion is given by Tq = — |s= 0 ) and other dynamic 

properties can be also written explicitly. This framework 
allows us to compare the search dynamics on DNA with 
different sequences. 

In the case of more homogeneous block copolymer se¬ 
quence (see Fig. lb), the mean search times are equal 
to 

rr, ^ koff + kon [{L/2 - P^) + e-(L/2 - P^)] 

° konkoff(l + P^ + e>^^PB) ’ 
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FIG. 2. Average times to find the target for block copoly¬ 
mer DNA sequence as a function of the scanning length 
A = \/u/koff. The transition rates are: u = 10® s~^ and 
kon = 0.1 s“^. The DNA length is L = 1000, and we vary 
the energy difference e (in units of ksT) for the interaction 
between the protein and A and B subunits on DNA. 


where 


p(0 = 


a;i-V2 _ 2.i+i^/2 


(4) 


+ k^^ff - \J (2m* + - 4uf 

~ 

for i = A and B. The results are presented in Fig. 2. 
Again, three dynamic search phases are clearly observed. 
Increasing the strength of interactions with B subunits 
make the search in the random-walk regime much slower. 
This is because the protein gets effectively trapped on B 
sites for A > L. 

Similar expressions for the mean first-passage times 
can be found for AB alternating DNA chains, as shown 
in the Supplementary Material. Here we use the pseudo¬ 
random alternating sequences, mimicking the real ran¬ 
dom situations, because the analytical results can be ob¬ 
tained for them. But we tested this approximation in 
computer Monte Carlo simulations by generating ran¬ 
dom sequences, and one can see from Fig. 3 that this 
assumption is fully justified. Another interesting obser¬ 
vation from Fig. 3 is that the chemical composition near 
the target might also affect the search dynamics. This 
can be found only for the intermediate sliding regime 
(1 < A < L) because in this case the probability fluxes to 
the target site from the solution and from the DNA are 
comparable. Modifying the composition of the sites near 
the target can change the amount of the flux coming from 
the DNA chain. The flux is larger for BTB sequences (2 
B subunits around the target), leading to the smaller 
search times. This is because the protein molecule at¬ 
tracts stronger to B sites and it has a higher probability 
to be found here and eventually to go the target. At the 
same time the flux is smaller for AT A sequences (2 A 
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FIG. 3. Comparison of the search times for alternating se¬ 
quences with random sequences generated in Monte Carlo 
computer simulations. The transition rates are: u = 10® s“^ 
and kon =0.1 s~^. The DNA length is L = 1000, the loading 
parameter is 0 = 0.5, and two different interaction strengths, 
£ = 10 and e = 5, are probed. 


subunits around the target) with weaker interactions to 
A sites, which yields slower search dynamics. For ATB 
sequences, as expected, the intermediate dynamics is ob¬ 
served. 

Now we can quantify the effect of sequence heterogene¬ 
ity in the protein search for the specific binding sites 
on DNA. The results in Fig. 4 present a ratio of the 
search times for block copolymer sequences, which are 
less heterogeneous, and for various alternating sequences, 
which are more heterogeneous, as a function of the slid¬ 
ing length on DNA. One can see that the effect of the 
sequence heterogeneity depends on the nature of the dy¬ 
namic search phase. In the jumping regime (A < 1), the 
symmetry of the sequence does not play any role. This is 
because in this case the process is taking place only via 
associations and dissociations (3D search), and the struc¬ 
ture of the DNA chain is not important. The situation 
is different for the intermediate sliding regime (3D+1D 
search, 1 < A < L) where in most cases the search on 
alternating sequences is faster. This can be explained by 
noting that the search time in this dynamic phase is pro¬ 
portional to L/\ [^ . which gives the average number of 
cycles before the protein can find the target. In the block 
copolymer sequence the protein mostly comes to the tar¬ 
get from the B segment because of stronger interactions. 
In the alternating sequences the protein can reach the 
target from both sides. It can be shown analytically (see 
the Supplementary Material) that the scanning length on 
the alternating segment is larger than the scanning length 
for the B segment, i.e., \ab > As. Then the search time 
is obviously faster for the alternating sequence because 
L/\ab < LjXB- The only deviation from this picture is 
found in AT A sequences where for small range of param¬ 
eters the search is slower than in the block copolymer 
sequence. The effect of the chemical composition near 
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FIG. 4. The ratio of the search times for the alternating 
DNA sequences and for the block copolymer DNA se quences 
as a function of the scanning length A = ^Ju/koff. Three 
different chemical compositions near the target are distin¬ 
guished, namely, AT A, ATB, BTB. The transition rates are: 
u = 10®s“^ and kon = 0.1 s“^. The DNA length is L = 1000, 
the loading parameter is 0 = 0.5, and the energy difference of 
interactions for the protein with A and B sites is e = 5. 

the target, as discussed above, is responsible for this. In 
the random-walk regime (ID search, A > L), the effect 
of the sequence heterogeneity is even stronger: the pro¬ 
tein molecule finds the specific binding site up to 2 times 
faster for more heterogeneous DNA chains. To under¬ 
stand this behavior, we note that in this case the mean 
first-passage time to reach the target is a sum of resi¬ 
dence times on the DNA sites. Because the target is in 
the middle of the chain, the mean time to reach the target 
from the block copolymer sequence will be Tq ~ {L/A)tb, 
where tb is the residence time at the site B. The av¬ 
erage starting position of the protein is L/4 sites away 
from the target. For the alternating sequences, the aver¬ 
age distance to the target is the same, but the chemical 
composition of intermediate sites is different, yielding. 
To ~ {L/8)tb + {L/8)ta- Obviously, the protein spends 
much less time on A subunits, and this leads to faster 


search for alternating DNA sequence. For ta ^ tb this 
also explains the factor of 2 in the search speed. In this 
case, the B subunits can be viewed as traps. Thus, in 
dynamic phases where the structure of DNA is important 
the sequence heterogeneity almost always accelerates the 
protein search for targets. 

In conclusion, we presented a theoretical analysis of 
DNA sequence symmetry and heterogeneity in the pro¬ 
tein search process. Using analytical solutions of the 
discrete-state stochastic approach that accounts for most 
important physical-chemical processes in the system, we 
obtained a full description of the search dynamics. It is 
found that the sequence heterogeneity is a crucial factor 
in the facilitated diffusion. Unlike the previous theoret¬ 
ical and computational models, our approach predicts 
that the sequence heterogeneity mostly accelerates the 
search. The mechanisms of this phenomenon depend on 
the nature of the search regime. It is either the smaller 
number of search cycles or the smaller number of trap¬ 
ping sites on the path to the target. We also found that 
in the dynamic phase where the specific binding site can 
be reached from the solution and from the DNA chain, 
the chemical composition near the target might influence 
the search dynamics. The search is faster if the target is 
surrounded by the subunits which interact stronger with 
the protein, providing it more opportunities to reach the 
target. Our theoretical results not only clarify the fun¬ 
damental physics of the protein search dynamics, but 
they also suggest that the biological processes can be 
effectively regulated by modifying the sequence symme¬ 
try and heterogeneity in DNA, as well as the chemical 
composition near the targets. Experiments to test these 
predictions should provide a better understanding of the 
microscopic mechanisms of complex biological processes. 
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